Homework 1 - PQHS 471
1 Question 1
1.1 Loading in Data
q1<-read.table("~/Downloads/commu_crime_na_filled.txt", header=TRUE)
q1<-q1[,-90]
q1.names<-read.csv("~/Downloads/communities.data", header=FALSE)
q11.2 Hierarchical Clustering
Our first method will be hierarchical clustering. I will set a seed (4) for this and future manipulations to ensure consistency when this .RMD publishes.
The chunk below scales the data, creates a distance matrix, and then creates a dendrogram using complete linkage.
The code included will also create objects containing the hclust object and the assigned clusters for each city in the data set. I chose to cut the dendogram at a distance of 4.2, which creates four clusters.
set.seed(4)
sd.data =scale(q1)
#par(mfrow =c(1, 3))
data.dist =dist(q1)
plot(hclust(data.dist), main = "Complete Linkage", labels = q1.names[,4],
xlab = "",sub = "", ylab = "", cex = 0.5)
abline(h = 4.2, col = "red")1.3 K-means Clustering
My next method will be k-means clustering. In this method, we have to choose the number of clusters that we want to end up with. Given the fact that we have four clusters from hierarchical clustering, it seems appropriate to create four clusters with the other methods (if able).
1.3.1 PCA for Question 1 - Axes for K-means Clustering Plot
Given the multi-dimensionality of the data, it makes sense to perform principal component analysis (PCA) on the data set to make it easier to display the clusters visually.
The code below will perform PCA on the data set. It will also tell us a fair amount of information about the resultant principal components.
set.seed(4)
pr.out=prcomp(q1, scale=TRUE)
pr.var=pr.out$sdev^2
pve=pr.var/sum(pr.var)
pca.names<- 1:89
head(data.frame(pca.names, pve*100),10)%>%
mutate(across(where(is.numeric), ~ round(., 1)))%>%
kbl(
caption = "<center><strong>PCA Variance Explanation</strong></center>",
col.names = c("PC #",
"Variance Explained (%)")) %>%
kable_classic(full_width = F, html_font = "Cambria")| PC # | Variance Explained (%) |
|---|---|
| 1 | 28.1 |
| 2 | 17.1 |
| 3 | 8.8 |
| 4 | 7.4 |
| 5 | 4.2 |
| 6 | 3.6 |
| 7 | 3.2 |
| 8 | 2.2 |
| 9 | 1.7 |
| 10 | 1.7 |
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')So, we can see from the plot above that the first principal component accounts for ~28% of the variability between cities in the dataset, the second accounts for 17.1%, the third 8.8%, and so on.
The last plot also shows us how the proportional contribution of the individual principal components adding up to account for the variance in the data set.
Now, let’s get to k-means clustering.
The code below will create k-means clusters out of our data and create plots to display how the cities in the data set were clustered with this method. The first plot shows the data in two dimensions plotted against the first two principal components, and the second plot shows the data plotted against the first three principal components.
set.seed(4)
km.out =kmeans(q1, 4, nstart = 20)
fviz_cluster(km.out, data = q1,
# palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)comp<-data.frame(pr.out$x[,1:3])
library(plotly)
plot_ly(x=comp$PC1, y=comp$PC2, z=comp$PC3, type="scatter3d", mode="markers", color=km.out$cluster)Make note of this pattern, because I’ll display the results of our clustering methods in this way so that we can make a comparison.
For starters, I’ll show the comparable plot for the hierarchical clusters we created previously on the first two and three principal components to see if there are differences:
fviz_cluster(list(data = q1, cluster=hc.clusters),
ellipse.type = "norm",
ellipse.level = 0.68,
geom = "point",
palette = "jco",
ggtheme = theme_minimal())There are similarities, but also clear differences.
Note the differences in the shapes of the clusters. One of the most notable differences is that in k-means clustering, cities with PC2 < -5 and PC1 < 5 are almost entirely clustered together. With hierarchical clustering, there is a cluster that contains almost all of the cities with a PC1 value less than -5.
Now let’s move on to our last clustering method.
1.4 Fuzzy Clustering
Soft or fuzzy clustering is a form of clustering in which each data point is partially classified into more than one cluster depending on its features. Membership grades are assigned to each of the data point based on the degree to which it belongs to each cluster. Points on the edge of a cluster, with lower membership grades, may be in the cluster to a lesser degree than points in the center of cluster. Essentially, this leaves room for grey area in its amount of ‘fit’. Hard clustering, the opposite, forces each data point into a cluster without appreciating its features that might fit well in others.
The code below will use cmeans() to create fuzzy clusters in the data set. I use the corrplot function below to show the first 100 data points and how well they fit into each of the four clusters. The darker the dot, the more that datapoint belongs in the indicated cluster.
set.seed(4)
library(e1071)
fuzz<-cmeans(q1, 4)
library(corrplot)
par(mfrow=c(1,4))
corrplot(head(fuzz$membership,25), is.corr = FALSE)
corrplot(fuzz$membership[26:50,], is.corr = FALSE)
corrplot(fuzz$membership[51:75,], is.corr = FALSE)
corrplot(fuzz$membership[76:100,], is.corr = FALSE)Below are the 2D and 3D plots of the data assigned to their respective soft cluster plotted on the principal components we created. Note that there are clear differences between these and the clusters we created using k-means and hierarchical clustering.
fviz_cluster(list(data = q1, cluster=fuzz$cluster),
ellipse.type = "norm",
ellipse.level = 0.68,
geom = "point",
palette = "jco",
ggtheme = theme_bw())1.5 Clusters by Method
Okay, now let’s compare the patterns of how data points were assigned to clusters using each of the methods. The table below includes the first 25 cities in the data set and their assigned cluster using each method.
What is important here is NOT what numerical cluster each city was assigned to, but patterns of which cities were clustered together. Looking at the table, it is apparent that some cities were clustered together by one or two methods, but not the other. This reinforces that there are differences in how the data points were clustered using the different methods.
clust<-data.frame(q1.names[,4], km.out$cluster, hc.clusters, fuzz$cluster)
head(clust,25)%>%
kbl(
caption = "<center><strong>Clustering by Method</strong></center>",
col.names = c("","K-Means Clustering",
"Hierarchical Clustering",
"Fuzzy Clustering")) %>%
kable_classic(full_width = F, html_font = "Cambria")| K-Means Clustering | Hierarchical Clustering | Fuzzy Clustering | |
|---|---|---|---|
| Lakewoodcity | 3 | 1 | 3 |
| Tukwilacity | 3 | 2 | 3 |
| Aberdeentown | 2 | 1 | 2 |
| Willingborotownship | 4 | 3 | 4 |
| Bethlehemtownship | 3 | 1 | 4 |
| SouthPasadenacity | 4 | 3 | 4 |
| Lincolntown | 3 | 1 | 4 |
| Selmacity | 1 | 4 | 2 |
| Hendersoncity | 2 | 4 | 2 |
| Claytoncity | 4 | 3 | 4 |
| DalyCitycity | 1 | 2 | 4 |
| RockvilleCentrevillage | 4 | 3 | 4 |
| Needhamtown | 4 | 3 | 4 |
| GrandChutetown | 3 | 1 | 4 |
| DanaPointcity | 4 | 3 | 4 |
| FortDodgecity | 2 | 1 | 2 |
| Albanycity | 2 | 4 | 2 |
| Denvilletownship | 4 | 3 | 4 |
| Valparaisocity | 3 | 1 | 3 |
| Rostravertownship | 3 | 1 | 2 |
| Modestocity | 1 | 2 | 3 |
| Jacksonvillecity | 2 | 4 | 2 |
| KlamathFallscity | 2 | 1 | 2 |
| SiouxCitycity | 2 | 1 | 2 |
| Delanocity | 2 | 4 | 2 |
Last, we can see in the tabyl below that there are differences in how cities were assigned to clusters using k-means and hierarchical methods within each soft cluster. The heterogeneity in numbers of cities assigned to clusters (note totals) again speaks to differences in resultant clusters using the different methods.
clust%>%
select(km.out.cluster,hc.clusters, fuzz.cluster)%>%
rename("K-Means"= km.out.cluster,"Hierarchical"= hc.clusters, "Fuzzy" = fuzz.cluster )%>%
tabyl(`K-Means`, Hierarchical, Fuzzy)%>%
adorn_totals(where = c("row", "col")) %>%
adorn_title(placement = "combined") %>%
kbl(caption="<center><strong>Clustering by Method (Fuzzy Cluster 1-4 Displayed from Left to Right)</strong></center>", longtable = TRUE)%>%
kable_paper(html_font = "Cambria", latex_options="scale_down")
|
|
|
|
2 Question 2
2.2 PCA for Question 2
Let’s start by performing PCA on the data. The code below creates objects out of the PCA analysis on the data set, the variance explained by each principal component, and the percentage of the variance in the whole data set that each principal component explains.
Now let’s take a look at these ideas visually. The first plot below shows the proportion of variance in the data set that the first ten principal components explain. This information is also shown in a table for ease of reading.
head(data.frame(pca.names, pve*100),10)%>%
mutate(across(where(is.numeric), ~ round(., 1)))%>%
kbl(
caption = "<center><strong>PCA Variance Explanation</strong></center>",
col.names = c("PC #",
"Variance Explained (%)")) %>%
kable_classic(full_width = F, html_font = "Cambria")| PC # | Variance Explained (%) |
|---|---|
| 1 | 8.1 |
| 2 | 3.4 |
| 3 | 3.3 |
| 4 | 3.2 |
| 5 | 3.1 |
| 6 | 3.0 |
| 7 | 2.9 |
| 8 | 2.9 |
| 9 | 2.9 |
| 10 | 2.9 |
Last, the code below will create a biplot showing each patient plotted against the first two principal components. I see a clear pattern here - don’t you? It looks like there is a wide gap on the first principal component axis between the first twenty patients (healthy) and second twenty patients (diseased).
fviz_pca_ind(pr.out, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)2.3 Clustering Question 2 - Hierarchical Clustering
Now let’s move on to hierarchical clustering.
2.3.1 Correlation-Based Distance
The code below scales the data and creates an object containing the correlation-based distance before creating a dendogram.
comp<-data.frame(pr.out$x[,1:4])
set.seed(4)
sd.data =scale(q2)
data.dist = get_dist(sd.data, method="pearson")
plot(hclust(data.dist), method="complete", main = "Complete Linkage",
xlab = "",sub = "", ylab = "", cex = 0.5)
abline(h = 1.09, col = "red")We can see that the first twenty patients and last twenty patients are clustered separately, with a distance of ~1.09. Interesting that there are differences between healthy and diseased patients…
Just to verify this notion, let’s check and see which patients ended up in which cluster. I’ve created a variable to indicate whether patients are healthy or diseased and juxtaposed it to their assigned cluster.
hc.out =hclust(data.dist)
hc.clusters = cutree(hc.out, 2)
pt.type<-data.frame(row(q2)[,1],hc.clusters)%>%
mutate(patient = case_when(
(row_number() %in% 1:20)~ "Healthy",
TRUE~ "Diseased"
))
pt.type%>%
select(patient, hc.clusters)%>%
kbl(
caption = "<center><strong>Healthy versus Diseased Clustering</strong></center>",
col.names = c("Patient Type",
"Cluster")) %>%
kable_classic(full_width = F, html_font = "Cambria")| Patient Type | Cluster |
|---|---|
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Healthy | 1 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
| Diseased | 2 |
1 2
Diseased 0 20
Healthy 20 0
Well that settles it. The differences between healthy and diseased patients were enough to lead to their being clustered separate from each other. There doesn’t appear to be any overlap.
Interestingly, the first principal component appears to separate the clusters best. The two 3D plots below plot the clusters against PC1, PC2, and PC3 and then PC2, PC3, and PC4. The clusters separate nicely on PC1 in the first plot, but then become mixed without any obvious pattern in the second. Whatever is in PC1 must be important to the differences that exist in healthy and diseased patients.
2.3.2 Comparing Linkage Types
The code below will perform hierarchical clustering using complete, average, and single linkage methods. All of the resultant dendograms show a similar pattern in clustering - healthy and diseased. To prove this, see the table created below indicating each patient, their status (healthy or diseased), and their assigned cluster using each linkage strategy. The pattern is the same across the baord.
In sum, no matter what kind of linkage we use, the result is the same.
set.seed(4)
hc.complete=hclust(data.dist, method = "complete")
hc.average=hclust(data.dist, method="average")
hc.single=hclust(data.dist, method="single")
par(mfrow=c(1,3))
plot(hc.complete,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average, main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single, main="Single Linkage", xlab="", sub="", cex=.9)hc.co<-cutree(hc.complete, 2)
hc.av<-cutree(hc.average, 2)
hc.si<-cutree(hc.single, 2)
pt.type<-data.frame(row(q2)[,1],hc.clusters, hc.co, hc.av, hc.si)%>%
mutate(patient = case_when(
(row_number() %in% 1:20)~ "Healthy",
TRUE~ "Diseased"
))
pt.type%>%
select(patient, hc.co, hc.av, hc.si)%>%
kbl(
caption = "<center><strong>Healthy versus Diseased Clustering</strong></center>",
col.names = c("Patient Type","Complete Linkage", "Average Linkage", "Single Linkage")) %>%
kable_classic(full_width = F, html_font = "Cambria")| Patient Type | Complete Linkage | Average Linkage | Single Linkage |
|---|---|---|---|
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Healthy | 1 | 1 | 1 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
| Diseased | 2 | 2 | 2 |
2.4 Bonus
Interesting question.
It is clear that PC1 creates the best differentiation between healthy and diseased patients. It seems that assessing each genes’ contribution to PC1 may be a reasonable metric to use in finding genes that differ the most between healthy and diseased patients.
Let’s look at the plots below. It appears that there are a small proportion of genes (~100 by the appearance of the second plot) that have moderate rotations in PC1, and the remaining genes have small rotations.
As a control metric, I created a data frame of the differences in each gene’s average expression in healthy and diseased patients. I then compared these differences to the absolute value of the same gene’s rotation in PC1.
The table below is long (sorry), but I wanted to make the point that there appears to be a relationship between a gene’s rotation in PC1 and the mean difference in gene expression between healthy and diseased patietns as calculated in the dataframe below. Note that both steadily decrease as you move down the table.
q2b<-t(q2)
q2b1<-data.frame(q2b[,1:20])
q2b2<-data.frame(q2b[,21:40])
q2bdiff<-rowMeans(q2b1)-rowMeans(q2b2)
q2bdiff.mean<-data.frame(1:1000, abs(q2bdiff),abs(pr.out$rotation[,1]))
names(q2bdiff.mean)<- c("gene", "mean_diff", "pca_rot")
head(q2bdiff.mean[order(-q2bdiff.mean$pca_rot),],150)%>%
kbl(
caption = "<center><strong>Mean Differences Between Diseased and Healthy and PC1 Rotations</strong></center>",
col.names = c("Gene","Mean Differences", "PC1 Rotation")) %>%
kable_classic(full_width = T, html_font = "Cambria")| Gene | Mean Differences | PC1 Rotation | |
|---|---|---|---|
| 502 | 502 | 2.5444609 | 0.0948504 |
| 589 | 589 | 2.3434914 | 0.0944977 |
| 565 | 565 | 2.4708205 | 0.0918382 |
| 590 | 590 | 2.2053241 | 0.0917317 |
| 600 | 600 | 2.7475772 | 0.0916732 |
| 551 | 551 | 2.3428449 | 0.0876836 |
| 593 | 593 | 2.4020959 | 0.0875862 |
| 538 | 538 | 2.3919786 | 0.0874540 |
| 584 | 584 | 2.6019855 | 0.0869086 |
| 509 | 509 | 2.3531934 | 0.0866102 |
| 511 | 511 | 2.3132558 | 0.0865513 |
| 570 | 570 | 2.2706461 | 0.0862646 |
| 599 | 599 | 2.4138156 | 0.0862431 |
| 535 | 535 | 2.2402204 | 0.0861009 |
| 528 | 528 | 1.9924967 | 0.0856143 |
| 592 | 592 | 2.1842740 | 0.0855861 |
| 566 | 566 | 2.1213545 | 0.0855301 |
| 564 | 564 | 2.1537022 | 0.0855207 |
| 540 | 540 | 2.5451736 | 0.0855074 |
| 508 | 508 | 2.0809369 | 0.0854137 |
| 554 | 554 | 2.4367177 | 0.0853939 |
| 11 | 11 | 2.0940957 | 0.0852155 |
| 548 | 548 | 2.2571178 | 0.0851578 |
| 536 | 536 | 2.2670467 | 0.0851402 |
| 563 | 563 | 1.9205127 | 0.0849285 |
| 588 | 588 | 2.0072099 | 0.0847297 |
| 555 | 555 | 2.0435628 | 0.0845421 |
| 520 | 520 | 1.9154914 | 0.0843101 |
| 522 | 522 | 2.1465577 | 0.0841999 |
| 558 | 558 | 1.8908142 | 0.0839016 |
| 574 | 574 | 2.1891454 | 0.0836927 |
| 559 | 559 | 1.9134076 | 0.0836331 |
| 530 | 530 | 1.9088348 | 0.0835904 |
| 569 | 569 | 1.9761596 | 0.0834494 |
| 514 | 514 | 2.0745379 | 0.0833856 |
| 15 | 15 | 2.1306259 | 0.0833707 |
| 561 | 561 | 2.4004469 | 0.0831013 |
| 513 | 513 | 2.4153434 | 0.0829114 |
| 16 | 16 | 1.9797135 | 0.0827456 |
| 575 | 575 | 2.0524602 | 0.0825844 |
| 505 | 505 | 2.1354602 | 0.0825725 |
| 503 | 503 | 2.2140494 | 0.0825674 |
| 549 | 549 | 2.5507567 | 0.0820319 |
| 582 | 582 | 2.4960844 | 0.0814645 |
| 539 | 539 | 1.8609756 | 0.0813224 |
| 13 | 13 | 2.0229594 | 0.0813182 |
| 526 | 526 | 1.9735051 | 0.0812224 |
| 515 | 515 | 1.7318145 | 0.0810760 |
| 516 | 516 | 2.2567546 | 0.0810632 |
| 562 | 562 | 2.4655491 | 0.0808857 |
| 568 | 568 | 2.5194183 | 0.0808201 |
| 547 | 547 | 1.9936445 | 0.0804744 |
| 527 | 527 | 1.9564579 | 0.0804677 |
| 12 | 12 | 2.3166049 | 0.0802807 |
| 571 | 571 | 2.1393023 | 0.0802315 |
| 534 | 534 | 1.9501233 | 0.0801885 |
| 579 | 579 | 1.9758579 | 0.0797707 |
| 523 | 523 | 2.2825173 | 0.0795457 |
| 541 | 541 | 2.0239611 | 0.0794231 |
| 507 | 507 | 1.9247132 | 0.0792445 |
| 550 | 550 | 1.9036654 | 0.0791731 |
| 18 | 18 | 1.8587970 | 0.0790526 |
| 504 | 504 | 1.8871367 | 0.0786392 |
| 545 | 545 | 2.0909676 | 0.0785524 |
| 529 | 529 | 2.3803677 | 0.0776381 |
| 586 | 586 | 2.0450671 | 0.0776283 |
| 533 | 533 | 2.0288908 | 0.0775131 |
| 544 | 544 | 1.8297158 | 0.0774975 |
| 521 | 521 | 2.0283827 | 0.0773548 |
| 531 | 531 | 1.8268448 | 0.0772049 |
| 595 | 595 | 1.8465195 | 0.0769854 |
| 512 | 512 | 1.8597888 | 0.0769505 |
| 596 | 596 | 2.0054760 | 0.0767437 |
| 517 | 517 | 1.9987917 | 0.0766804 |
| 546 | 546 | 2.2631532 | 0.0766770 |
| 542 | 542 | 1.9397265 | 0.0765169 |
| 587 | 587 | 2.0515892 | 0.0761105 |
| 583 | 583 | 2.0829797 | 0.0760889 |
| 598 | 598 | 1.8171639 | 0.0758403 |
| 591 | 591 | 1.9952959 | 0.0757735 |
| 557 | 557 | 1.6179701 | 0.0756431 |
| 501 | 501 | 1.7050061 | 0.0749214 |
| 580 | 580 | 1.5528136 | 0.0743236 |
| 524 | 524 | 2.1133090 | 0.0742861 |
| 572 | 572 | 2.0012904 | 0.0742220 |
| 20 | 20 | 1.9473131 | 0.0741622 |
| 553 | 553 | 1.6619197 | 0.0737646 |
| 17 | 17 | 1.8163671 | 0.0735899 |
| 532 | 532 | 1.6535325 | 0.0735879 |
| 597 | 597 | 1.7652763 | 0.0734921 |
| 519 | 519 | 1.4478523 | 0.0732986 |
| 567 | 567 | 1.9522874 | 0.0730010 |
| 556 | 556 | 1.8509786 | 0.0725488 |
| 506 | 506 | 1.8641219 | 0.0724897 |
| 577 | 577 | 1.8054775 | 0.0719603 |
| 537 | 537 | 1.4820046 | 0.0718737 |
| 576 | 576 | 1.9730217 | 0.0718200 |
| 560 | 560 | 1.8876578 | 0.0715854 |
| 14 | 14 | 1.8176984 | 0.0714753 |
| 19 | 19 | 1.6468829 | 0.0706200 |
| 552 | 552 | 1.6263043 | 0.0692426 |
| 525 | 525 | 1.5367276 | 0.0685103 |
| 543 | 543 | 1.7907337 | 0.0682978 |
| 581 | 581 | 1.5139918 | 0.0676986 |
| 585 | 585 | 1.5566643 | 0.0663186 |
| 578 | 578 | 1.5436369 | 0.0662102 |
| 594 | 594 | 1.4982538 | 0.0650860 |
| 510 | 510 | 1.5262318 | 0.0640798 |
| 156 | 156 | 1.0931812 | 0.0623729 |
| 518 | 518 | 1.5009283 | 0.0619118 |
| 573 | 573 | 1.6040790 | 0.0572774 |
| 135 | 135 | 0.9453316 | 0.0562318 |
| 967 | 967 | 0.9914602 | 0.0531267 |
| 448 | 448 | 0.9552971 | 0.0519146 |
| 615 | 615 | 1.0829394 | 0.0510663 |
| 853 | 853 | 0.9854869 | 0.0492812 |
| 939 | 939 | 0.7587715 | 0.0477184 |
| 172 | 172 | 0.8246742 | 0.0471786 |
| 243 | 243 | 0.8522348 | 0.0448446 |
| 393 | 393 | 0.6504216 | 0.0448403 |
| 116 | 116 | 0.7591049 | 0.0447638 |
| 457 | 457 | 0.7144575 | 0.0435036 |
| 331 | 331 | 0.8396971 | 0.0433897 |
| 98 | 98 | 0.5569672 | 0.0429602 |
| 715 | 715 | 0.7998315 | 0.0427917 |
| 462 | 462 | 0.8080792 | 0.0423691 |
| 751 | 751 | 0.6027722 | 0.0423344 |
| 389 | 389 | 0.6277691 | 0.0420895 |
| 77 | 77 | 0.7570385 | 0.0419401 |
| 900 | 900 | 0.5912002 | 0.0412228 |
| 860 | 860 | 0.6938482 | 0.0403874 |
| 752 | 752 | 0.7357372 | 0.0402600 |
| 293 | 293 | 0.7302956 | 0.0400526 |
| 802 | 802 | 0.6565890 | 0.0399415 |
| 61 | 61 | 0.6582499 | 0.0395674 |
| 337 | 337 | 0.6959353 | 0.0391795 |
| 768 | 768 | 0.7347575 | 0.0386251 |
| 686 | 686 | 0.6094919 | 0.0380952 |
| 326 | 326 | 0.5900243 | 0.0380079 |
| 346 | 346 | 0.5524822 | 0.0379718 |
| 104 | 104 | 0.6662910 | 0.0379223 |
| 263 | 263 | 0.6930379 | 0.0377150 |
| 142 | 142 | 0.7289325 | 0.0376839 |
| 373 | 373 | 0.6164435 | 0.0374662 |
| 49 | 49 | 0.7397824 | 0.0372169 |
| 680 | 680 | 0.6132496 | 0.0369390 |
| 214 | 214 | 0.7701767 | 0.0367594 |
| 634 | 634 | 0.6234604 | 0.0366687 |
| 852 | 852 | 0.6897570 | 0.0366071 |
| 675 | 675 | 0.5788977 | 0.0366056 |
Lastly, the plots below reinforce the idea that there is an obvious relationship between genes’ PC1 rotations and mean differences in expression between health and diseased patients. Note that there appear to be right skews in both of these measures. It appears that a threshold of 1.3 in mean difference and 0.055 in PC1 rotation appear to distinguish outlying genes from the rest of the pack.
library(patchwork)
p1<-ggplot(q2bdiff.mean, aes(x=mean_diff, y=pca_rot))+
geom_point()+
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "PC1 Rotation v Mean Difference",
x = "Mean Difference Between Diseased and Healthy", y = "Absolute Value of PC1 Rotation") +
geom_vline(xintercept =1.3)+
geom_hline(yintercept =0.055)+
theme(aspect.ratio = 1)+
theme_classic()
p2<-ggplot(q2bdiff.mean, aes(x=mean_diff, fill=mean_diff>1.3))+
geom_histogram()+
guides(fill=FALSE)+
theme_bw()+
labs(title = "Histogram of Mean Differences Between\nHealthy and Diseased Patients", x="Mean Difference (absolute value)", y="Count")
p3<-ggplot(q2bdiff.mean, aes(x=pca_rot, fill = pca_rot>0.055))+
geom_histogram()+
guides(fill=FALSE)+
theme_bw()+
labs(title = "Histogram of PC1 Rotations", x="PC1 Rotation (absolute value)", y="Count")
p1 / (p2 + p3)Using these thresholds, we identified 110 genes that might be of interest. They are displayed below, along with the mean difference in expression between groups and PC1 rotations.
[1] 110
q2bdiff.mean[order(-q2bdiff.mean$pca_rot),]%>%
filter(pca_rot>0.055 & mean_diff>1.3)%>%
kbl(
caption = "<center><strong>Mean Differences Between Diseased and Healthy and PC1 Rotations</strong></center>",
col.names = c("Gene","Mean Differences", "PC1 Rotation")) %>%
kable_classic(full_width = T, html_font = "Cambria")| Gene | Mean Differences | PC1 Rotation |
|---|---|---|
| 502 | 2.544461 | 0.0948504 |
| 589 | 2.343491 | 0.0944977 |
| 565 | 2.470820 | 0.0918382 |
| 590 | 2.205324 | 0.0917317 |
| 600 | 2.747577 | 0.0916732 |
| 551 | 2.342845 | 0.0876836 |
| 593 | 2.402096 | 0.0875862 |
| 538 | 2.391979 | 0.0874540 |
| 584 | 2.601986 | 0.0869086 |
| 509 | 2.353193 | 0.0866102 |
| 511 | 2.313256 | 0.0865513 |
| 570 | 2.270646 | 0.0862646 |
| 599 | 2.413816 | 0.0862431 |
| 535 | 2.240220 | 0.0861009 |
| 528 | 1.992497 | 0.0856143 |
| 592 | 2.184274 | 0.0855861 |
| 566 | 2.121354 | 0.0855301 |
| 564 | 2.153702 | 0.0855207 |
| 540 | 2.545174 | 0.0855074 |
| 508 | 2.080937 | 0.0854137 |
| 554 | 2.436718 | 0.0853939 |
| 11 | 2.094096 | 0.0852155 |
| 548 | 2.257118 | 0.0851578 |
| 536 | 2.267047 | 0.0851402 |
| 563 | 1.920513 | 0.0849285 |
| 588 | 2.007210 | 0.0847297 |
| 555 | 2.043563 | 0.0845421 |
| 520 | 1.915491 | 0.0843101 |
| 522 | 2.146558 | 0.0841999 |
| 558 | 1.890814 | 0.0839016 |
| 574 | 2.189145 | 0.0836927 |
| 559 | 1.913408 | 0.0836331 |
| 530 | 1.908835 | 0.0835904 |
| 569 | 1.976160 | 0.0834494 |
| 514 | 2.074538 | 0.0833856 |
| 15 | 2.130626 | 0.0833707 |
| 561 | 2.400447 | 0.0831013 |
| 513 | 2.415343 | 0.0829114 |
| 16 | 1.979713 | 0.0827456 |
| 575 | 2.052460 | 0.0825844 |
| 505 | 2.135460 | 0.0825725 |
| 503 | 2.214049 | 0.0825674 |
| 549 | 2.550757 | 0.0820319 |
| 582 | 2.496084 | 0.0814645 |
| 539 | 1.860976 | 0.0813224 |
| 13 | 2.022959 | 0.0813182 |
| 526 | 1.973505 | 0.0812224 |
| 515 | 1.731815 | 0.0810760 |
| 516 | 2.256755 | 0.0810632 |
| 562 | 2.465549 | 0.0808857 |
| 568 | 2.519418 | 0.0808201 |
| 547 | 1.993645 | 0.0804744 |
| 527 | 1.956458 | 0.0804677 |
| 12 | 2.316605 | 0.0802807 |
| 571 | 2.139302 | 0.0802315 |
| 534 | 1.950123 | 0.0801885 |
| 579 | 1.975858 | 0.0797707 |
| 523 | 2.282517 | 0.0795457 |
| 541 | 2.023961 | 0.0794231 |
| 507 | 1.924713 | 0.0792445 |
| 550 | 1.903665 | 0.0791731 |
| 18 | 1.858797 | 0.0790526 |
| 504 | 1.887137 | 0.0786392 |
| 545 | 2.090968 | 0.0785524 |
| 529 | 2.380368 | 0.0776381 |
| 586 | 2.045067 | 0.0776283 |
| 533 | 2.028891 | 0.0775131 |
| 544 | 1.829716 | 0.0774975 |
| 521 | 2.028383 | 0.0773548 |
| 531 | 1.826845 | 0.0772049 |
| 595 | 1.846520 | 0.0769854 |
| 512 | 1.859789 | 0.0769505 |
| 596 | 2.005476 | 0.0767437 |
| 517 | 1.998792 | 0.0766804 |
| 546 | 2.263153 | 0.0766770 |
| 542 | 1.939726 | 0.0765169 |
| 587 | 2.051589 | 0.0761105 |
| 583 | 2.082980 | 0.0760889 |
| 598 | 1.817164 | 0.0758403 |
| 591 | 1.995296 | 0.0757735 |
| 557 | 1.617970 | 0.0756431 |
| 501 | 1.705006 | 0.0749214 |
| 580 | 1.552814 | 0.0743236 |
| 524 | 2.113309 | 0.0742861 |
| 572 | 2.001290 | 0.0742220 |
| 20 | 1.947313 | 0.0741622 |
| 553 | 1.661920 | 0.0737646 |
| 17 | 1.816367 | 0.0735899 |
| 532 | 1.653533 | 0.0735879 |
| 597 | 1.765276 | 0.0734921 |
| 519 | 1.447852 | 0.0732986 |
| 567 | 1.952287 | 0.0730010 |
| 556 | 1.850979 | 0.0725488 |
| 506 | 1.864122 | 0.0724897 |
| 577 | 1.805478 | 0.0719603 |
| 537 | 1.482005 | 0.0718737 |
| 576 | 1.973022 | 0.0718200 |
| 560 | 1.887658 | 0.0715854 |
| 14 | 1.817698 | 0.0714753 |
| 19 | 1.646883 | 0.0706200 |
| 552 | 1.626304 | 0.0692426 |
| 525 | 1.536728 | 0.0685103 |
| 543 | 1.790734 | 0.0682978 |
| 581 | 1.513992 | 0.0676986 |
| 585 | 1.556664 | 0.0663186 |
| 578 | 1.543637 | 0.0662102 |
| 594 | 1.498254 | 0.0650860 |
| 510 | 1.526232 | 0.0640798 |
| 518 | 1.500928 | 0.0619118 |
| 573 | 1.604079 | 0.0572774 |